EDA Project

The EDA project in this course has four main parts to it:
1. Project Proposal (complete) 2. Phase 1 3. Phase 2 4. Report This notebook will be used for Project Proposal, Phase 1, and Phase 2. You will have specific questions to answer within this notebook for Project Proposal and Phase 1. You will also continue using this notebook for Phase 2. However, guidance and expectations can be found on Canvas for that assignment. The report is completed outside of this notebook (delivered as a PDF). Detailed instructions for that assignment are provided in Canvas.
Read this before proceeding: 1. Review the list of data sets and sources of data to avoid before choosing your data. This list is provided in the instructions for the Project Proposal assignment in Canvas.

2. It is expected that when you are asked questions requiring typed explanations you are to use a markdown cell to type your answers neatly. Do not provide typed answers to questions as extra comments within your code. Only provide comments within your code as you normally would, i.e. as needed to explain or remind yourself what each part of the code is doing.

Project Proposal

The intent of this assignment is for you to share your chosen data file(s) with your instructor and provide general information on your goals for the EDA project.
Step 1 (2 pts): Give a brief description of the source(s) of your data and include a direct link to your data.

Step 1 - Answer

Data Sources include:

Datasets include the following:

SourceFocusDescriptionURL ReferenceData FormatIncludes geographic coordinates (Y/N)Key Required (Y/N)Sample Data File Link
GeotabDevice Status Info Current state of vehicle data transferLinkjsonYYGIT HUB Link
GeotabTrip Summary Vehicle trip summary based on input criteria (device identifier and date range search) Link jsonNYGIT HUB Link
GeotabGPS Log Record Individual GPS log records based on input criteria (device identifier and date range search) Link jsonYYGIT HUB Link
GeotabException Events Individual GPS exceptions records based on predefined "rules" specification - for street sweeping Link jsonNYGIT HUB Link
City of Vancouver Open DataStreet Public Streets (Arterial, Secondary Arterial, Collector) The dataset for public streets is categorized by the type of street segment, for the purposes of this project the street segments to be analized will exclude residential streets and pathways Link geojsonYNGIT HUB Link
City of Vancouver Open DataBikeways The dataset for bycicle lanes (either separated lanes or officially designated road infrastructure) - as "nearly 13% of commuting trips in Vancouver are by bike" City of Vancouver it is important to report on public work on bikeway infrastructure Link geojsonYNGIT HUB Link

The scope of this work also includes building the data access modules to access and assemble the data on a weekly basis.

Within the parameter file that builds the data pull and data joins - there is an option to pull the data directly from the source datasets or use the files within the Project working directory data folder - for the purposes of this project - the user will have the option to use stored files or access the data sources directly (as temporary credentials are included within the project files).

Step 2 (2 pts): Briefly explain why you chose this data.

Step 2 Answer

I have chosen to focus on accessing and performing exploratory data analysis on this data for several reasons, both personally and professionally.

Professionally
A number of my current project rely on integrating our vehicle location and telemetry data associated with Geographic Information Systems (GIS) data to identify work areas and report work completion statistics. Accessing this information has been challenging because these systems are hosted on distributed vendor cloud infrastructure. We have not yet had an opportunity to build a proof of concept on aggregating the vehicle GPS data to report on street-based work completion statistics.

Personally
I have not yet had the opportunity to build a data feed to this live-big data source. It is exciting to push myself to understand how this API works and make the connection framework that would allow us to access this data needed. I also have not had any experience within GitHub code repository, it is my plan to store and access my code using this code repository tool.

The information contained within this data is an essential aspect of several initiatives; and would allow me to address many long-standing issues, including:

  1. Technical challenges: This limited my project's work in this area, with a lack of technical ability to access the hosted data using the Geotab API. This course project is my motivation to further my understanding of accessing API-based datasets using python.
  2. COVID Delays: The City of Vancouver council prioritized Public Realm Maintenance and Cleanliness in late 2019. Engineering Services was tasked with an increased mandate for public realm cleaning services; the City's COVID response had impacted this mandate in 2020 and early 2021 with the city's COVID response. Engineering is now focusing efforts on addressing the issues identified by the City of Vancouver city council to accurately report the service status of cleaning efforts across the city. This project will act as a proof of concept for reporting requirements.
  3. Data Accuracy: Vehicle GPS relies on connections with at least four (4) satellites to triangulate a vehicle's position. The system we are using (Geotab) currently provides data with approximately 3-meter accuracy at 99% confidence level. The challenge for urban centers occurs for vehicle travel in corridors of tall buildings, and the GPS signal to the vehicle is lost - this effect is also known as an Urban Canyon.

A screen display of vehicle travel by street sweepers downtown is included:

The urban canyon can be seen in the vehicle travel downtown, resulting in challenges in reporting vehicle completion results. My work will be focused on using traditional GIS analysis techniques (like spatial aggregation, buffer, spatial joins) to build a dataset that looks to clean some of the ambiguity from the raw GPS data.

Step 3 (1 pt): Provide a brief overview of your goals for this project.

Step 3 Answer

Business Objectives


OBJ 1 - create a stable Datafeed connection and format for weekly street sweeping data summary reports
OBJ 2 - clean and format datasets with the purpose to generate a weekly business report
OBJ 3 - combine the GPS detailed trip location points with non-spatial data (travel speed, equipment telemetry)
OBJ 4 - using a geographic buffer (for example, 10 meter) around major street segments and bike lanes
OBJ 5 - using GIS methods (spatial join), associate the street sweeping trip with the nearest street segment buffer or bike lane buffer based on the minimum length of travel (expected minimum travel distance: 20 meters)
OBJ 6 - generate summary statistics on the street sweeping work by identifying the street segments with the most street sweeping

Technology Goals

Exploratory Data Analysis (EDA) Goals


Data Processing generate data access methods to access and process required datasets
Model Fit create a data model appropriate for the business objectives
Data Cleanliness identify data with high amounts of data variability (or missing data) and develop appropriate methods to address issues
Data Quality generate metrics on data quality measures to identify opportunities to improve the data model further.

Within the git hub project, python files generate the connections to the data sources and build the data model, including the following files street_sweeping.py, geotab_testing.py, and assemble_data.py.

Within these files, several functions transform the information from JSON outputs to somewhat cleaned and normalized dataframes, which can address the business objectives.

Running the cell below (or calling the street_sweeping.py file directly) will initiate the following: a system check on the modules currently installed within the python environment - to ensure the user can process the code successfully

  1. a geotab (username and password) file check to determine if the user should have the choice to pull the data directly from the stored data files or pull the data from the source API dataset
  2. if the system checks are valid - the user will have options on the source and the amount of information displayed on importing the dataset
  3. when the dataset has imported, the user will see a map display summary of the mostly-raw dataset - there are some data cleaning and business logic data assembly built into the initial "get" commands

Please Note: the required credentials to access the Geotab API have been included in the project folder (in a credentials folder geotab.dat file, the user credentials will automatically expire August 31, 2021. The user credentials will only allow access to street sweeper data.

Step 4 (1 pt): Read the data into this notebook.

Sample output from geodataframe using plotly

Step 5 (1 pt): Inspect the data using the info( ), head( ), and tail( ) methods.
STOP HERE for your Project Proposal assignment. Submit your (1) original data file(s) along with (2) the completed notebook up to this point, and (3) the html file for grading and approval.
Instructor Feedback and Approval (3 pts): Your instructor will provide feedback in either the cell below this or via Canvas. You can expect one of the following point values for this portion. 3 pts - if your project goals and data set are both approved.
2 pts - if your data set is approved but changes to your project goals (Step 3) are needed.
1 pt - if your project goals are approved but your data set is not approved.
0 pts - if neither your data set nor your project goals are approved.

As needed, follow your instructor's feeback and guidance to get on track for the remaining portions of the EDA project.

EDA Phase 1

The overall goal of this assignment is to take all necessary steps to inspect the quality of your data and prepare the data according to your needs. For information and resources on the process of Exploratory Data Analysis (EDA), you should explore the EDA Project Resources Module in Canvas. Once you’ve read through the information provided in that module and have a comfortable understanding of EDA using Python, complete steps 6 through 10 listed below to satisfy the requirements for your EDA Phase 1 assignment. **Remember to convert code cells provided to markdown cells for any typed responses to questions.**
Step 6 (2 pts): Begin by elaborating in more detail from the previous assignment on why you chose this data?
1. Explain what you hope to learn from this data. 2. Do you have a hunch about what this data will reveal? (The answer to this question will be used in the Introduction section of your EDA report.)

I expect to identify a number of trends within the dataset

  1. Service Frequency: what areas of the city are being serviced with high frequency; and conversly what areas of the city have low frequency of service
  2. Asset Use: are there vehicles with substanitially high usage rates
  3. Shift Efficiency: are there shifts with very low street sweeping rates?

Because of on-going societal issues within Vancouver's Downtown Eastside, I expect that a large amount of work effort and street sweeping events are occuring in within and adjacent to the "east side", the remaining work areas should have relatively equal distribution of street sweeping, with possibly more focus on seasonal hotspots including tourist attractions, parks and beach areas.

I anticipate that the distance of street sweeping will be evenly distributed across weekdays, and will likely skew away from afternoon shifts. The amount of street sweeping is dependent upon the streetlanes being accessable by the equipment - the volume of traffic and deliveries along arterial street segments limited the ability of the sweepers to work during high volume periods.

Step 7 (2 pts): Discuss the popluation and the sample:
1. What is the population being represented by the data you’ve chosen? 2. What is the total sample size?

Population: The population for this project is limited to Ten (10) City of Vancouver Street Sweepers
Operating Hours: The Street Cleaning unit opperates over three daily shifts - the business unit is in operation 24 hours per day seven days per week

While the shift times start at specific times - crews generally require 1 hour after the initiation of a shift to leave the work yard area; there are a number of pre-work meetings (safety, work area specific focus) and vehicle pre-trip inspections that are required.

Data Challenge 1: For street sweeping work that occurs between 10am and 2 pm it is almost impossible to differentiate the unit shift between Day and Afternoon shift without integrating this dataset with SAP (which is beyond the scope of this project) - and would contravene a number of City Technology Services Policies regarding data privacy policies. As a result, the following assumption for categorizing street sweeping shifts.
Assumption for Data Challenge 1: Shift Association with Trip Times
  • If the trip begins between 7am and 2pm it is associated with Day shift
  • If the trip begins between 2pm and 8pm it is associated with Afternoon shift
  • If the trip begins between 8pm and 7am is is associated with Night shift
Business Rule 1: For Night Shift trips occuring between midnight (00am) and 7am, work day is the day in which the shift begins - which is the day before the trip.

For trips starting between 00 and 6 am the shift day will be the day prior to the current weekday.

Total Sample Size: the data sample for this project is limited to a single working week (seven day) time period which begins Sunday at midnight (12am) and concludes Saturday at 11:59pm.

Typically this sample time period includes approximately 400 vehicle trips and approximately 80% of municipal street sweepers in operation.

Step 8 (2 pts): Describe how the data was collected. For example, is this a random sample? Are sampling weights used with the data?

There are two main methods of data collection for the data files used in this project:

Reference:
Curve Logic - Geotab
As Built Description

Step 9 (4 pts): In the Project Proposal assignment you used the info( ) method to inspect the variables, their data types, and the number of non-null values. Using that information as a guide, provide definitions of each of your variables and their corresponding data types, i.e. a data dictionary. Also indicate which variables will be used for your purposes.

I have built a function that combines pandas standard functions of pd.Dataframe.info, pd.Dataframe.describe and a metadata table in excel where the analysis and feature descriptions are stored. The output includes the dataframe name, column details (including number of non-null records, unique values as well as the summary statistics for numaric variables)

Step 10 (10 pts): For full credit in this problem you'll want to take all necessary steps to report on the quality of the data and clean the data accordingly. Some things to consider while doing this are listed below. Depending on your data and goals, there may be additional steps needed than those listed here. 1. Are there rows with missing or inconsistent values? If so, eliminate those rows from your data where appropriate. 2. Are there any outliers or duplicate rows? If so, eliminate those rows from your data where appropriate. At each stage of cleaning the data, state how many rows were eliminated. 3. Are you using all columns (variables) in the data? If not, are you eliminating those columns? 4. Consider some type of visual display such as a boxplot to determine any outliers. Do any outliers need removed? If so, how many were removed? At each stage of cleaning the data, state how many rows were eliminated. It is good practice to get the shape of the data before and after each step in cleaning the data and add typed explanations (in separate markdown cells) of the steps taken to clean the data.
Include the rest of your work below and insert cells where needed.

Summary Details for Datasets

Generate the summary details for the datasets - build the data dictionary and summary visualizations

Modify and limit the contents of the key datasets

Focusing on the two main geodataframs trips_gdf and streets_gdf

Review key data visulaization outpus

Reviewing the output of the visualizations - a number of data clean-up tasks need to be done

Data Formatting (Date/Time)

  1. If the data is loaded directly from the API - the box plot for date/time based data does not diplsay
  2. If the data is loaded from the json file the date time values are numeric - and will need to be converted to datetime
  3. The service provider uses UTC time, in order for the date/time to align with other business details - some of the date time columns are updated in the data collection script - the remaining columns will need to be updated

Assemble the Geospatial Data


Configuring the arterial streets to act as "containers" for the vehicle trip records

  1. In order to preform geospatial analysis (spatial join, buffer, clip) on the datasets the default coordinate system for both files need to be converted from geographic coordinate system to City of Vancouver standard Univeral Transverse Mercator (UTM) Zone 10 North (epsg:26910)
    • first, we need to set the geographic coordinate system for each geodataframe, epsg reference for geographic coordinate system is as 4326
    • second, we need to transform the data from the geographic coordinate system to the City of Vancouver standard projected coordinate system, UTM zone 10 N
  2. Generate a 15 meter buffer around each street segment within geopandas
  3. Reconfigure the geodaframe for future analysis
  4. Review the data updates by displaying the information generating a map* output
Configuring the trip details to be associated with the street segments
  1. Convert the trip segements from geographic coordinate system to City of Vancouver standard UTM Zone 10 North (epsg:26910)
  2. Clip the trip segments to the buffered street segments
  3. Explode the trip summary geometry from a multiline geometry record to individual records for each segment of the trip that is contained within the arterial street area
  4. Calculate the length of each trip segement that has been clipped
  5. Filter the trip segements that are shorter than 20 meters in lenth
  6. Join the Trip segements and the arterial street buffers using the geopandas Spatial Join function, using the identity overlay option
  7. Clean the trip attribute street name and hundred block details, to account for short segments generated by intersecting street buffers within the clip and overlay process

* Map Notes: in order for the information to be displayed against a mapbox basemap - the geometry data must be transformed into Geographic coordinate system.

References:

https://spatialreference.org/ref/epsg/nad83-utm-zone-10n/ https://geopandas.org/gallery/plot_clip.html https://geopandas.org/docs/user_guide/set_operations.html

Configure Streets Data

Because the output of the buffer for each street is a geoseries, the new polygon geometry needs to be added to the streets geodataframe, the dataframe will need to be reconfigured to contain only the polygon geometry

Configure Trips Data

Geometric manipulation of the Clipped Trip segments

Attribute updates for Trip Dataset

Display Trip linesegments that have been clipped by the Arterial Street Segement Buffer Zone

Preform geometric overlay

Will overaly the clipped street segments and the street buffers - to return the address details for each trip, for more details on the geopandas overlay function, please refer to the following Geopandas Documentation .

There are a number of overlay options - however the option most suited for this analysis is "identity" other options include: 'intersection', 'union', 'identity', 'symmetric_difference', 'difference'

Review the results in the map window, organized by street names, if you zoom into the downtown area - you will see a number of areas where the street names do not match the street name observed for the vehicle travel direction.

For a detailed example - that will be used after the dataset has been cleaned, we will use a random trip segement from the dataframe to review

In this example, we can see a number of discrepancies for the trip, where the street name and the hundred block details contain the street of vehicle travel as well as the intersecting street segement buffer details. This will cause issues for summary reporting.

Cleaning the Spatial Attributes

The summary trip information is valid - we will build a grouped results table where the unique records for each trip segement is returned.

The get_likely_street_names function will loop through each record of the grouped results to identify the likely street segement of the trip segement based on the previous and next strip segment information.

Note: The get_likely_street_names function is a work in progress - I will likely be updating it as new data and further data quality review steps are taken

Clean the Trip Attributes

By looping the trip details through get_likely_street_names function as well as generating updated summary information

Combine the Cleaned Results

Combine the gemetry values from the spatial overlay with the cleaned attribution data, using the trip unique_id as the primary key, this is a many to one left hand join as we will be merging the cleaned summary table to the larger geometry subset, which will result in cleaned attributes.

Check the Results

Using the trip details reviewed earlier - compare the the following elements:

Display the cleaned data in a map

A visual review of the data allows us to get an overall usderstanding of the updated attributes

Save the results

Because we are using geospatial data, the results need to be saved as a geojson, which is similar to a json file other than the definition of the geometry which enalbes the data to be accessed directly from geopandas or traditional GIS sotware (for example: ArcGIS).

For more information on geopandas import/export options, please refer to: GeoPandas Documentation

Link to datasets also available on github

STOP HERE for your EDA Phase 1 assignment. Submit your cleaned data file along with the completed notebook up to this point for grading.

EDA Phase 2

All of your work for the EDA Phase 2 assignment will begin below here. Refer to the detailed instructions and expectations for this assignment in Canvas.